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1 INTRODUCTION 


Embedded computing systems are becoming essential to many critical applications, such 
as flight control and life support systems. To meet the requirement of uninterruptible ser- 
vice during each mission, such a system is often built with multiple computing channels so 
that faults can be masked during normal operation. Each of these redundant channels is 
physically independent of others and is composed of a complete set of processors, memory, 
and other control circuits, so that each fault-free channel can accomplish the functions as- 
signed to the original set of redundant channels. An N modular redundant (NMR) system 
can mask up to faults by tightly synchronizing, and voting on, the operations of 

N channels. A channel is said to have error(s) when (physical) faults within the chan- 
nel manifest themselves, generating outputs different from those generated by a fault-free 
channel. 

One key design consideration of NMR systems is the frequency of voting on channels’ 
outputs. Increasing the voting frequency can improve fault detection capability but may 
degrade system performance. Different designs of voters can be found in the Fault-Tolerant 
Multiprocessor (FTMP) [1] and the Fault- Tolerant Processor (FTP) [2], both of which were 
intended for life-critical real-time applications like flight control for commercial airplanes. In 
FTMP every common memory access requires voting 1 , whereas in FTP votes are taken only 
on certain data. Thus, by its less intrusive voting, FTP can provide better performance 
than FTMP. The reliability and fault recovery problems of FTP are explored through 
a case study for an unmanned reusable launching system called the Advanced Launch 
System (ALS). ALS is a typical application of a class of architecture called the Advanced 
Information Processing Systems (AIPS), which is currently being developed by Charles 
Stark Draper Laboratory (CSDL) under the support of the National Aeronautics and Space 
Administration (NASA). 

FTP voter design imposes few constraints on the architecture but is relatively slow, 
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i.e., it takes several steps to vote on a message/data, and each step takes approximately 5 
/iseconds, when compared to the instruction cycles of contemporary microprocessors. This 
in turn implies that FTP’s channel outputs should not be voted on frequently. As embedded 
systems are becoming increasingly complex, one must carefully investigate the dynamics of 
system failure for life-critical applications with long mission times. 

The first important problem to be addressed in this report is the probability of system 
failure due to nearly- coincident faults in FTP. By developing a realistic system model, we 
shall show this probability to be negligible. Then, we analyze the probability of resource 
exhaustion 2 for applications with long mission times. A serious drawback of low-speed 
(slack) voters is that when channels have large main memory, it is very time-consuming to 
re-align all channels with a slack voter into an identical state. 

To alleviate the difficulty associated with memory re-alignment, a monitoring technique 
using signature analysis was proposed in [3]. In this method, main memory is decomposed 
into signature pages, and memory accesses to each page are encoded into a signature which 
is then stored in an independent signature memory. A fault is thus detectable only when 
a faulty word is accessed. Upon detection of a fault in main memory, only those pages 
with different signatures need to be re-aligned. The signature analysis cannot completely 
overcome the memory re-alignment problem, because even though a massive redundant 
system may have congruent inputs, errors caused by random permanent/transient faults 
that occur in memory cells cannot be detected by this technique. Thus, in the worst case, 
the whole main memory must be realigned when such faults occur. 

Channel failure rate has the most pronounced effects on system reliability, because it (i) 
determines when a resource exhaustion occurs, and (ii) affects the process of fault detection, 
fault location, and reconfiguration. When channel failure rate is high, the success of a 
mission will be greatly affected by the quality of fault handling processes. On the other 
hand, when channel failure rate is low, rare occurrences of faults will lower the demand on 

2 due to loss of resources as a result of failures 
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system resources (e.g., hardware, computing time, and software routines) for fault handling. 
System design would thus be greatly simplified if the channel failure rate can be effectively 
reduced. 

As a solution to both the reliability enhancement and memory re-alignment problems, 
we propose to use channel error maskers (CEMs). The main motivation behind this pro- 
posal is to make channels more reliable by masking channel faults with CEMs. When the 
reliability of each channel is improved, the need of memory re-alignment can he reduced 
significantly. It is shown that the reliability of ALS is dramatically improved when the 
CEMs for main memory are implemented by common single error correction/double error 
detection (SEC/DED) codes. Furthermore, using CEMs can speed up the memory re- 
alignment process substantially, because only those faulty words uncovered by CEMs need 
to be recovered by the voter. Two different schemes, called SchemeJL and Scheme_2, 
respectively, are developed for the re-alignment of main memory. In Scheme_l, main 
memory is decomposed into recovery pages, and a page is re- aligned only when it cannot be 
recovered by CEMs. An optimization technique is developed to find the optimal page size 
for Scheme_l. In Scheme_2, addresses of faulty memory words are recorded, and only 
those recorded faulty words need to be re- aligned. 

In order to assess the FTP’s capability for the ALS mission, the basic operational 
principles of FTP are first introduced in Section 2. In Section 3, we develop a reliability 
model which is then used to evaluate the FTP’s reliability for ALS. CEMs are then applied 
to solve the memory re-alignment problem in Section 4. The report concludes with a few 
remarks in Section 5. 

2 FTP Architecture 

The architecture of FTP, and the memory access model for the analysis of multiple 
channel faults, are introduced in this section. An FTP channel may have one or two 
processors. When two processors are built into each channel, one processor is dedicated to 
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computation functions and the other to I/O functions. The two processors in a channel 
communicate with each other by writing and reading messages in a shared memory. There 
are interval timers and watchdog timers in each channel for task scheduling and time-out 
interrupts. 

FTP can provide high performance, and its architecture can be in the form of simplex, 
duplex, triplex (TMR), or quadruplex (QMR). In a redundant FTP system, only clocks in 
the different channels of FTP are tightly synchronized, and thus, fault-free channels can 
execute identical instructions in lock-step. Channels communicate with each other via a 
network formed by the communicators in redundant channels. 

A block diagram of the communicator network in a QMR FTP is given in Fig. 1 where 
a communicator is composed of a set of registers, a transmitter (single input, N outputs), 
interstage (single input, N outputs), and a receiver ( N inputs, single output). There are 
four channels, called channel A, 5, C and D, respectively. The set of registers Ay, Xr, Xe , 
and Xy in channel Y, Y G {A, jB,C\ D}, store inputs and outputs of channel Y to/from 
the communicator network. 

Logically, channels exchange data by reading/writing data from/to the set of regis- 
ters in the communicator. Data communications between channels are classified as voted 
data- exchange , and simplex data-exchange. FTP design emphasizes the concept of source 
congruency : for all types of data-exchanges, all correctly operating channels will eventually 
receive identical copies of data. A voted data-exchange allows channels to compare their 
outputs and mask any error whenever possible. A voted data-exchange is accomplished by 
writing a value to Xy, and then reading the voted result from Xr . Register Xe in each 
channel records any discrepancy between its Xy and Xr values. The actual steps in a voted 
data-exchange are that (1) every channel sends a message to its transmitter which will then 
relay the message to its own interstage, and (2) through the fully-connected network from 
N interstages to N receivers, the receiver in every channel gets a voted message and stores 
it in Xr . 


4 



A simplex data-exchange can be used by a channel to broadcast messages to the other 
channels. For example, if a message needs to be transferred from channel A to the others, all 
channels execute an instruction “ write message $ to X^”. When the instruction “write $ 
to Xa ” is executed by channel A, $ will be broadcast via the transmitters to all interstages. 
In the meantime, the pseudo-messages $ sent by channels B y C and D are discarded by the 
communicator network. After all interstages receive replicated copies of $, $ is broadcast 
on the interstage network, and every receiver will have the voted $ stored in X/fs. Through 
such an exchange process, data congruency is guaranteed for both voted and simplex data- 
exchanges. 

3 Reliability Analysis 

To justify the use of a single fault model, Section 3.1 presents a preliminary examination 
of system failures due to nearly-coincident faults when the system uses memory coding and 
segmentation. Then, using the single fault model, the probability of system failure due to 
exhaustion-of-parts, and the reliability impact of CEMs on the ALS will be analyzed in 
Section 3.2. 

3,1 Probability of Nearly-coincident Faults 

A complete reliability model of FTP must include both sequential and nearly-coincident 
faults. However, incorporation of nearly-coincident faults into the reliability model will 
make the analysis very complex, because it deals with a multivariate distribution. To alle- 
viate the complexity, calculation of the probability of nearly-coincident faults is separated 
from that of the probability of system crash caused by resource depletion. 

Since a channel’s operation depends heavily on its access to main memory, a memory 
access model needs to be developed for the evaluation of nearly-coincident faults. The 
existence of memory access locality has been the key to the modeling of memory access 
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behavior. That is, once a program starts to access a specific memory area, it tends to 
access the area continuously for a certain period. Thus, although memory cells are physically 
identical, different parts of main memory must be distinguished when they have different 
logical uses. 

Using memory access locality, a program’s memory access behavior can be modeled by 
an active agent visiting main memory and forming access sets [4]. An access set is defined 
as a memory area that is continuously accessed for a certain period of time during each visit. 
We further assume that (1) locations of all access sets in the system do not change, (2) the 
number of access sets in the system is fixed, (3) all access sets have the same size u and are 
disjoint with one another, and (4) memory segmentation and single-error-correction and 
double-error-detection (SEC/DED) codes are implemented in the memory, so that latent 
faults are covered by a background scrubbing process and the SEC/DED codes when the 
memory is not accessed. A faulty memory segment is replaced at the end of the current 
visit of the active agent. Optimization of the memory segment replacement procedure is 
discussed in Section 4. Based on these assumptions, an access set can be used to denote a 
set of “physical” memory cells in a certain area. 

There are m access sets {AS 1 , AS 2 , • • • , A<S' m } in main memory. Let Vj denote the event 
of the agent’s j-th visit to AS 1 and Mj € 3i + U {0} be the V-'s lifetime. Mj’s are assumed 
to be independent and identically distributed (i.i.d.) random variables, and the agent’s 
present and future visits are independent of its past visits. Based on this memory access 
model, a nearly-coincident fault occurs when more than one agent (i.e., channel processor) 
either become faulty or visit faulty access sets during one inter- voting period. 

To calculate the probability of nearly-coincident faults, it is assumed that memory is 
free of faults initially and is not re-aligned when the voter can mask faulty channel outputs. 
Voters are assumed to be the only fault detection/masking mechanism in the system. Once 
the processor enters an access set, any of the faults in the processor, voter, or the access 
set will cause a faulty output on the channel. Note that faults in one access set do not 
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propagate to another access set. For convenience of presentation, all processor and voter 
faults are classified as access set faults. Since the component failure rate is very low as 
compared to memory access times, a large number of access sets will be visited between 
two successive fault occurrences, and no new fault is likely to occur in an access set during 
a visit to the access set. Faults occur independently according to a Poisson process. 

A system with m access sets is said to be in state i when the agent is in AS 1 . Let S\ 
be the time V fc ‘ begins — the agent’s k- th visit to access set AS 1 — and let N'(t) = sup 
{fc | < t} , N l {t) 6 I + U {0}, Vi, t. For a given time interval [0, t), t > 0, the total number 

m 

of visits made by the agent to access sets is N(t) = E N k (t). One or more access sets 

k = 1 

may have been visited before the channels vote on their outputs. Let the random variable 
Xf € I + U {0} denote the number of faults occurred in AS J during the agent’s i-th visit (to 
some access set). Since faults occur uniformly within memory and Xj’s are assumed to be 
i.i.d., XPs will be represented by a single random variable X. Thus, at any time instant, 
the agent’s decision on which access set to visit makes no difference. Let Y{ = T t - — T,_i 
where Tj is the time of the j-th voting on channel outputs. When Y^s are i.i.d., they can 
be represented by a single random variable Y. 

Assuming that the agent has visited l access sets during [0,Tj_i), at time T^_ x there are 
IX faults in AS'*. In an NMR system, let P C (Y;) be the probability of a channel generating 
a faulty output during the time interval [Tj-i,Tj). During [0, Tj), the total probability of 
system crash due to nearly-coincident faults becomes 

Pn(Tj) = E E . ) Pr(no failure before T k ^)(P c (Y k )y{l - P c (Y k )) N ~' (3.1) 

< EE (3.2) 

*=ii=rfi v > 

To evaluate P c (Yi), within [T)_i, Tj) the probability of a faulty output generated by a 
channel is 

P c (Yj|u; visits between two successive votings) = 1 — (P(X = 0)) w 
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— \ __ g — wuXY 

» umAy. (3.3) 

where u and A are the size of access set and the failure rate of a memory word, respectively. 
When w — 1, i.e., channels vote on their results after accessing each access set, Eq. (3.2) 
can be simplified as 

PN(Tj ) < j f; («AF)’(1 - uAY) n -’\ (3.4) 

<=r#i * ' 


3.2 Analysis of ALS 

In this subsection, the probability of system crash due to nearly-coincident faults, and 
the effectiveness of CEMs are discussed using the ALS mission scenario. The ALS will first 
sit on the launching pad for a week, and will then be in the boost phase for 10 minutes. Any 
approval for launch requires the system to have fault masking capability. The system must 
have 0.95 probability of availability, 0.98 probability of mission success, and less than 10 -5 
system unreliability at the end of mission. Since information on the maintenance schedule 
and the requirement for mission success are not available, we will focus on system reliability 
and the probability of system possessing fault masking capability before launch. 

The parameters necessary to estimate the reliability of ALS are derived from the results 
of the Entry Research Vehicle (ERV) study. Permanent failure rates of the processors 
(including control circuits) and the interstage are predicted to be X p = 8.91 x 10 -6 /hour, 
and Aj = 1 X 10 -6 /hour, respectively [5]. Permanent failure rates of 64 A x 4 RAM chips and 
128 A x 8 ROM chips are predicted to be 6 X 10 _6 /hour, and 2.8 x 10 -6 /hour, respectively. 
A redundant FTP equipped with 1M bytes of ROM and RAM in each channel is considered 
as an example ALS controller. Thus, the main memory needs 32 (8) RAM (ROM) chips, 
and the total failure rate of RAM (ROM) is X m = 192 x 10~ 6 /hour (A 0 = 20.8 x 10 _6 /hour). 
Note that the above failure rates have been adjusted by the environment and quality factors, 
n e and n„ i.e., X x <— IIA^, where x € {p,i,o,m}, and II = n e II g . Since II 9 == 0.5 and 
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n e = 3 in the ERV study, the actual component failure rates are A p = 5.94 x 10~ 6 /hour, 
A 0 = 13.86 x 10 _6 /hour, A,- = 1 x 10 -6 /hour, and A m = 128 X 10 -6 /hour, respectively. With 
these parameter values, one can see that in the ALS, 96% of the channel faults are caused 
by main memory faults. This can be broken down to 86.8% of the faults due to RAM and 
9.2% due to ROM. 

The system cycle of FTP is 40 msec, within which all the essential control functions, 
including fault recovery processes, must be completed for the system to function acceptably. 
It takes about 11 //seconds to vote on one memory word — the processor reads a memory 
word, votes on it, reads the result back from the voter, and then writes the voted word back 
to the memory. Because of the relatively low system failure rate, and the frequent memory 
scrubbing, it is reasonable to assume that the system is free of latent faults. 


Note that when a fault occurs in the access set that is currently being visited by the 
agent, the fault cannot be detected/corrected by the scrubbing process, because the scrub- 
bing process possesses the lowest priority. In the FTP, Computing channels vote on their 


outputs at least every 40 mseconds, i.e., T, — Ti-t < 40 mseconds. Thus, given that no fault 
occurs before T,_i, and T r _\ c - -C A 0 , where A a is the failure rate of an access set (including 
processor, interstage, memory and the access set itself), the probability of system crash due 


to nearly-coincident faults in an NMR system during Y{ = is 

P c (Yi) = Y, ( N ^\ e- (N - j)XaTi (e- XaT ‘-'\ a Yi) j 

Mfl V' 


N 


'N s 


,-NXaTi-i 


Mfl ^ 


(a a Yy. 


(3.5) 


Within [0,t), the total probability of system crash due to two channel errors is 

Pcrashit) < ~(^j(Yx\ a ) 2 + 0(h) 

< tGYXl 


(3.6) 

( 3 . 7 ) 


In Eq. (3.7), the probability that the system does not crash before t is not considered, 
and 0(h) is the probability of 3 or more channels becoming faulty simultaneously. The 
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probability of system crash due to nearly-coincident faults in the FTP is plotted in Fig. 2. 
This probability is shown to be very small even when the the size of access set is very large. 

After evaluating the probability of nearly-coincident faults, one can develop a continuous- 
time Markov model for the reliability analysis of a QMR system due to resource exhaustion. 
As shown in Fig. 3, states A, B, C, D and E are used to denote the conditions where the 
system has four, three, two, one fault-free channels, and system crash, respectively. The 
model can be modified for a TMR system with state A removed. In this model, At (A /,) 
is the failure rate of transient (hard) faults, c is the recovery coverage of transient faults, 
and Cd is the reconfiguration coverage of a duplex configuration. Assuming that a channel 
will be retired if any of its components becomes faulty, the total failure rate of a channel is 
A c = A p + A m + A 0 + A,-. (See Appendix A for definitions of A’s.) 

A similar, but more complicated, FTP reliability model has been developed by CSDL 
[5], In the CSDL’s model, every component failure is considered to be an independent event, 
and the system reconfiguration time is treated as a random variable with an exponential 
distribution. Our model differs from CSDL’s in that (1) system states are defined by the 
number of fault-free channels, (2) different component failures in one channel are aggregated 
into one single event, because when component failures are memory less, and reconfiguration 
rates for different component failures are the same, the channel failure rate is the sum of 
component failure rates, and (3) system reconfiguration is considered to be done instanta- 
neously, because it is usually done in one system cycle 40 mseconds, or 9000/hour, which 
is extremely fast relative to faults’ inter-arrival times. 

Next, we want to evaluate the effectiveness of CEMs. A channel with embedded CEMs 
will be retired if CEMs become faulty. Thus, the channel failure rate becomes A c = Ap + 
Ap + Ap + Ap + (1 — c p )Xp + (1 — c,)Aj + (1 — c 0 )A„ -1- (1 where Cp,Cj,c ( ,, and 

(Ap, Ap, Ap, Ap) are the coverage (failure rate) of CEMs for processors, interstage, ROM 
and RAM, respectively. 

As mentioned in the beginning of this subsection, 96% of channel failures are due to 
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main memory failures. Thus, adding CEMs to main memory can dramatically reduce the 
channel failure rate. On the other hand, CEMs could be designed for processors (and control 
circuits), but this is more difficult and has little impact on system reliability, since only 4% 
of channel failures result from this portion of hardware. Consequently, the design of CEMs 
for processors will not be considered any further. Note that CEMs would be inefficient 
if they could not achieve high fault coverage during the mission. Assuming that CEMs 
for memory can correct w bit-errors out of an n-bit word, and faults in memory bits are 
independent of each other, one can derive the coverage of CEMs at time t as: 


c m\ c oJ ~ i _ e -\t ’ 


(3.8) 


where A is the failure rate of a memory word. For the FTP example, if we use 7 extra 
bits to encode a 32-bit data word by SEC/DED codes, we get A = §§ x 10 -9 /hour-word, 
and c m « Co « 1 - 2xl0~ 7 atf — 200 hours. However, when multiple-bit chips are used, 
other coding schemes should be employed [6] to provide high fault coverage. Since the 
implementation of CEMs for main memory is straightforward with standard commercial 
error controllers (e.g., 74ALS632B), they will not be discussed any further in this report. 

Evaluation of the reliability of a redundant system with CEMs is very simple when 
the system has perfect reconfiguration capability. For example, consider two redundant 
systems with N and W computing channels, respectively. CEMs are embedded into the 
NMR system, denoted as NMR-CEM, but no CEM is embedded into the WMR system. Let 
the channel failure rate of NMR-CEM (WMR) be A' (A c ), and N < W, then the probability 
of NMR_CEM (WMR) crash before time t is P/v(t) = (1 - e~ x ' ct ) N (Pw(t) = (1 - e~ Xt ) w ). 
When A t < 1 (A 't < 1), P N (i ) t* (A 't) N ( P w (t ) » (At) w ). Thus, an NMR-CEM is more 
reliable than a WMR system when A' < A y(A t) w ~ N . Note, however, that a numerical 
method is usually called for when systems do not have perfect reconfiguration capability. 

Using the component failure rates predicted by the ERV study, numerical solutions of 
the ALS reliability with and without CEMs are calculated by METASAN™ [7], Let the 

™METASAN is a registered trademark of the Industrial Technology Institute. 
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failure rate of CEMs for memory be the same as that of an interstage, the coverage of 
transient faults be 1, and the coverage of duplex system be 0.9, i.e., c p = ci = 0, = A;, 

c = 1, and Cd = 0.9. The probability of system crash on the launch pad for TMR and QMR 
systems with and without CEMs are plotted in Figs. 4 and 5. The two diagrams in Fig. 4 (5) 
show the reliability impacts of CEMs when II = 0.1 and 31 = 1, respectively, where II is a 
adjusting factor of channel failure rate. In Fig. 4, SEC/DED codes are embedded into RAM 
only, and in Fig. 5, SEC/DED codes are embedded into both ROM and RAM. Clearly, a 
TMR system with the entire memory (ROM and RAM) encoded is more reliable than a 
conventional QMR system even for very short missions and very low component failure rates. 
Furthermore, while the reliability improvement by changing from a conventional TMR to 
a QMR system is in the order of 10 to 100, when CEMs are embedded into main memory, 
the reliability improvement by upgrading a system from TMR_CEM to QMR-CEM is in 
the order of 10 3 to 10 4 . 

The probability of FTP retaining fault masking capability for the ALS is examined 
next. As shown in Fig. 6, the probability that a conventional QMR system retaining fault 
masking capability decreases quickly with increases in II (i.e., component failure rates) and 
launch waiting times. On the other hand, since channels in TMR.CEM or QMR_CEM are 
inherently reliable, the probability of launch approval increases substantially even for very 
long waiting times. 

Finally, the total system reliability throughout the mission can be derived as follows. 
The system unreliability is the sum of the probability of system crash before launch, and the 
probability of system crash during the launch. Since the system cannot be launched unless 
the FTP has fault masking capability, we can calculate the probability of system crash 
during the boost period conditioned on that the FTP has fault masking capability. When 
the boost time is less than 20 minutes, the probability of system crash during the launch is 
lower than 10 -7 for systems without CEMs, and the figures are much lower for systems with 
CEMs. Thus, the probability of system crash during the on-pad waiting period is much 
higher than the probability of system crash during the boost phase. 
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4 Memory Re-alignment 


Application of CEMs to the memory re-alignment problem is the subject of this section. 
In a conventional QMR system, the probability that the channels need to be re-aligned is 
P r (t ) = 1 — e~ 4X,t , where A* is the transient failure rate of RAM. When At = 128 X 10 -5 , we 
get P r (200) « 0.64, implying that memory faults should be a serious concern to any system 
design. 

Theoretically, when a transient fault occurs in memory, the fault can be corrected by 
memory re-alignment. However, since it is very time-consuming to re-align channel mem- 
ories, and since it is difficult to discriminate permanent, intermittent, and transient faults 
in a limited amount of time, it is highly desirable to correct faults, if possible, by CEMs 
without using memory re-alignment. For example, when SEC/DED codes are embedded 
into main memory, the transient failure rate is reduced by a factor of 2 X 10 -7 . Plugging the 
new failure rate into P r {t), we get P r (200) « 2 x lO -7 . Thus, for the ALS mission scenario, 
channels’ main memory re-alignment is unlikely to be called for when CEMs are embedded 
into main memory. 

In addition to dramatically reducing the need of memory re-alignment, the fault-masking 
capability of CEMs can be used to speed up the process of memory re-alignment substan- 
tially. Two schemes, called Scheme_l and Scheme_2, are developed for the re-alignment 
of main memory. In Scheme_l , the entire memory space of W words is decomposed into 
K recovery pages , )•")%) where |fl{| — < K. When the system decides to 

start memory re-alignment, all channels scan through main memory page-by-page. After 
each page of different channels is scanned, channels have the scanned page re-aligned if 
any one of them is found to be faulty. The procedure is repeated until the entire memory 
system is completely scanned and/or re-aligned. When two pages have different sizes, we 
can repeatedly subtract 1 byte from the page of larger size, and add 1 byte to the other 
until the difference of their page size is less than, or equal to, 1. Thus, when ^ is not an 
integer, there is at most one byte difference among pages. Since the reliability difference 
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and re-alignment overhead caused by the one-byte difference in page size is negligible, it is 
assumed that K can always divide W without leaving a non-zero remainder. 

In Scheme_2, the entire memory is decomposed into and where fii is a fault 
register of variable size, and &2 is the rest of main memory. When main memory needs to 
be re-aligned, the CPU in each channel scans through its main memory and places addresses 
of faulty words into its fault register. After all channels complete their memory scan, they 
use simplex data-exchanges to broadcast addresses of faulty words, and then vote on each 
faulty word using the voted data-exchange. 

Details of Scheme_l and Scheme_2 are described in pseudo codes as follows. 

Scheme_l (channel-i) 
begin 

Synchronize channels to start the re-alignment 
n = 1; 

while ( n < K ) /* scan recovery pages, K is the number of pages*/ 

do 

A= “fault-free”; /*The current page is assumed to be fault-free */ 
scan flfi , 

if (Sl n faulty & cannot be corrected by CEMs) A= “faulty”; 
write A to Xy] 

if (Xr= “faulty” or Xr £ 0 ) /* at least one channel has a faulty page */ 
do /*re-align f l n * / 

j=i; 

while (j < 

do write 0, n (j) to Xv', 
write Xr to Ct n (j)", 

j=j+i; 

end_do 

end_do 

end_do 

end 

Scheme_2(channel-i) 

begin 
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Synchronize channels to start the re-alignment 

j = l; 

k = 1; 

while ( j < W ) /*scan main memory, W is the total memory size*/ 

do 

read M(j ); /*read the j'-th word*/ 

if ( M(j ) faulty & cannot be corrected by CEMs) 

do 

write j to Sli(k); /*find a faulty word, and record its address in the fault register */ 
k=k+l; 
end_do 
end_do 

write “EOF” to Cli(k)] /* channel^' finishes scanning */ 
write “Ready” to Xy] 

while ( Xe i 1 zero) write “Ready” to Xy] /*wait until all channels finish scanning*/ 
for (n=A to D) /* re-align faulty words one by one, starting from channel A to D*/ 
k=l; /* pointer of channel n */ 
while (X R ^“ EOF”) 
do 

write Qi(k) to X n ; /*only channeLn can make a simplex data-exchange 

other channels’ write commands will be ignored by the system */ 
read Xr; /*every channel reads the address of the faulty word in channeLn*/ 
if (Xr ^ “EOF”) 

do 

T = Xr] I*Xr contains the address of the faulty word*/ 
write M(T ) to Xy] /*channels vote on the faulty word*/ 
write Xr to M(T ); /*channels write the voted result back*/ 
end_do 
k=k+l; 
end_do 

end 

Scheme_l is more robust than Scheme_2, because in Scheme_l all channels are 
executing identical instructions in lock step, and any mismatch between channels can be 
easily detected. Thus, fault-free channels can always complete memory re-alignment without 
being affected by faulty channels. On the other hand, Scheme_2 is faster but more prone 
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to errors, because the completion of memory re-alignment can be guaranteed only when 
faulty channels can correctly interact with fault-free channels. For example, if the CPU 
program counter in one channel stops at a certain point, all the other channels running 
Scheme_2 will be stuck in waiting loops. Although this problem can be easily fixed by 
adding a time-out to each waiting loop, Scheme_2 needs a substantial modification to 
make it robust. 

Both Scheme_l and Scheme-2 induce a fixed overhead Wt m to scan main memory, 
where t m is the memory cycle time. (Due to its unimportance to the optimization problem 
to be discussed, this fixed overhead will not be mentioned in the rest of this section.) The 
performance overhead of Scheme_2 is linearly proportional to the total number of faults, 
whereas Scheme_l may be substantially slower than Scheme-2 , i.e., Scheme-1 is faster 
than Scheme_2 only when g > K + where g is the total number of faults, and m 
is the number of re- aligned recovery pages, because the value of K in Scheme-1 can be 
greater than the value of g in Scheme_2. 

The speed of Scheme-1 is primarily determined by the size of recovery page and system 
reliability. Denote the number of recovery pages to be re- aligned by a random variable F, 
0 < F < K. Then, the memory re-alignment time is 

W 

tra ~ {K + F—)t v;t 

where t v is the time to take a vote, i.e., the total time to write Xy, and read Xr and 
Xe- From Eq. (3.8) it is not difficult to see that the perfect fault detection assumption is 
reasonable even when the channel failure rate is high and CEMs have only fault detection 
capability, e.g., even/odd parity codes. When CEMs have only perfect detection capability, 
the probability that / faulty recovery pages have occurred in the system by time t is 
Pk{F == /) = (^)Rp K ~^{t)( 1 - R p (t))j, where 1 - R p (t) is the probability that one or 
more of the recovery pages which are in different channels but have the same page number 
are faulty. Let A and q denote the failure rate of a memory word and the number of 

w . 

redundant channels, respectively, then R p (t) = e~ 9 Let ip = WqXt , and K have been 
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determined, then the conditional probability of / recovery pages needing to be re aligned is 

W - e~K)f 

Pic(tra ) = P(Ua = K + /— |memory is faulty) = -±— — , (4.1) 

The objective of recovery page design is to minimize the re-alignment overhead so that 
at time t, the probability of re-alignment requiring more than a time period T is less than 
e. Therefore, a solution K is feasible when Pk ( t ra > T) < e. The optimization problem is 
essentially a non-linear integer programming problem, and can be stated formally as follows: 

min Z{t) = K + F% 

subject to I(el + ,K<W 

Pfc(t ra = K + /f ) = (^) e -^d-^)(i _ e 5r)//(i ~ e -+) 

Ph'(t ra >T)<€. 

When T > Wt v , the recovery page design is trivial, because the memory can be easily 
re- aligned by voting on every word. When T < Wt v , and the recovery coverage of CEMs 
is c, no solution can be found if the optimal page size based on the given c is not feasible. 
Since the coverage of CEMs is very high, the design problem can be focused on page size 
optimization , while the feasibility problem can be easily solved by an exhaustive search. 
When K* 1, an exhaustive search for K* is the only course to take. On the other hand, 
when K* » 1, it will be shown that K* can be found through a conventional continuous 
variable optimization technique. 

Lemma 1: Given ip and K, Pk{F = /), the probability of / faults simultaneously oc- 
curring to the system, is a monotonically decreasing function of / when - 1) < 1, 

1 < / < K. The sufficient condition for Pk{F = /) to be a monotonically decreasing 
function of / is (e$ - 1) < K > 1. 

Proof: Since Pk(F = f ) > 0, Pk(F = /) is monotonically decreasing if ^ < 1, 

V/. Using Eq. (4.1), we have ~ l’)» or Pk(F = f ) is monotonically 

decreasing if — 1) < 1. Note that 0 < (e* — 1) < 1 when p < 0.693. Since 
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j-pt < , and the maximum value of is ^5^, if > 1, the sufficient condition 

for the ratio test to hold is (e^r — 1) < if > 1. I 

Lemma 2: When the sufficient condition of Lemma 1 holds, P(t ra > K + fj^) < Pk(F = 

/) > where Pf = 7rf( e * - !)■ 

Proof: When the sufficient condition of Lemma 1 holds, Px(F = / + 1)/Pjv(P = /) < 

and nj < 1, Since /i/ < /i/ + i,V/, we have ]T^Pj><:(F = i) < Pft:(F = /)(1 + /«/ + 

*'=/ 

i_ rtW+i 

M/ ••• + /if" 7 ), or P(i ra > K + f f ) < P*(F = /) ^ ■ • ■ 

Note that <C K holds for most realistic parameter values. When Lemma 1 holds, and 
K and e are given, fx — inf/*, such that P(£ ra > if + fij^) < Vz, can be determined by 
applying Lemma 2 repeatedly. That is, when the main memory consists of K pages, with 
a probability of greater than 1 — e that the number of faulty pages is less than or equal 
to fx, The next lemma states a key condition that can greatly simplify the optimization 
algorithm. 

Lemma 3: If K X {K 2 ) > 1, K X {K 2 ) > and K X (K 2 ) » /, then P Kl (F = f) « Pk 2 (F = 
/), where P^.(/ ) is the probability of P = / when the number of recovery pages is K{. 

Proof: P k {F = f) = - e^y. When if > /, (f) « , and e « 

e - ^. Furthermore, when if ;> tp, we get 1 — w 1 — (1 — Combining the 

above expressions leads to Pr-(F = /) « or Pk(F = /) w That is, 

Pk(F = f) is predominately determined by /, and is insensitive to if. Thus, Pki(F = 
/) » Pk 2 (F = /) holds. ■ 

Lemma 3 is valid for a very broad range of if values, and when ifi, if 2 >• 1, Pk,(F = /)’s 
are very close to each other. When Lemma 1 holds, Pki(F = f \ ) < Pk 2 (F = /2), where 
/1 > /2- Pk{F = /)’s with different if values are plotted in Fig. 7. In these examples, 
the system has W = 4M words of memory, g = 4 channels, A = 0.75 byte/10 9 hours, and 
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t — 150 hours. Thus, ^ = XtqW = 1.8, Thus, \Psoo{F — 1) - Pi 35 oo(F = 1)| < 0.05, and 
\PMF — 3) — Pjs 5 oo(F = 3)| < 0.001. Denoting the optimal value of K by I(*, the most 
desirable property of Lemma 3 is that when K* >• 1, we get fx « fx*> and thus, K* can 
be found by the following Theorem. 

Theorem 1: When K* >> 1, K* « VTk^i where K is the number of memory pages, 
1 <C K < W. 

Proof: From Lemma 3, we get Px^F — f) « Pk 2 {F = /),V/. Thus, when K* > 1, 
we have fx ~ /«■*» or /#* can be found by applying Lemma 2 to an arbitrary K such 
that P(f > fx) < e. Clearly, for a given e, /# « /, V/f >■ 1, where / is some constant. 
The cost function Z(t) to be minimized can be expressed as min(Jf + /— ). Since the 
objective function is convex when K is continuous, the optimal solution of real- valued K's 
is K' — \/JkW. Then, K* can be found by an exhaustive search in [K' — S, K' + ^], where 
6 is some constant yet to be found. I 

Given the probability, e, of the system failing to complete the re-alignment before time t r , 
we want to find the number of faulty pages, fx, that can be re-aligned with an approximate 
value of K . Since fx ~ fx * when K* is large, we derive a near-optimal page size K', 
and K* is then derived from K' by an exhaustive search. An example cost function Z(t) 
is plotted in Fig. 8. The curve shown in Fig. 8 is K + fx LxJ • The integral constraints on 
K and cause the sawtooth curve in [K — A K,K 4- A K], but have only a small effect 
on the global curve shape. In this example, e = 10 -5 , ^ = 1.8, and thus fx = 10. Thus, 
K' = \/l0 x 4 x 10 6 = 6324.5. (When fx — 40, a similar requirement can be met, but the 
number of pages is nearly doubled. This is because we want to reduce the page size (and 
thus increase the number of pages) to reduce the total page re-alignment time.) Through an 
exhaustive search, it is found that there are multiple optimal solutions, and the one closest 
to K' is 6320. The discrepancy between the result obtained from Theorem 1 and the exact 
solution is due to the integral constraints on K and ^ . Thus, having found K' , the optimal 
solution can be easily found by using K* — min if, |_^J = However, from a practical 
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viewpoint, the difference between K' and K* is less than 0.1 %, and thus, it is reasonable 
to use \K'\ as an optimal solution. 

From the above example, we can see that even when CEMs have fault detection capa- 
bility only, the performance of Scheme_l is nearly thousand times better than voting on 
every word. The performance will be further improved if CEMs also have fault recovery 
capabilities. Using the example shown in Fig. 8, W — 4 x 10 6 and 1 — Cc ~ 2 x 10 -7 for 
SEC/DED codes, we get fk = 1, and K* « 2000, when e remains the same (10 5 ). 

Cost functions for systems with and without SEC/DED codes are plotted in Fig. 9. 
When the memory access time is 500 nanoseconds, it takes 2 seconds for a channel to scan 
main memory. The total re-alignment times for systems without CEMs is 11 seconds. On 
the other hand, when a fault occurs in a QMR.CEM system, with a probability greater 
than 1 — 10 -5 , it will take less than 2.045 seconds to complete memory re-alignment. 

5 Conclusion 

The reliability of redundant computing systems used for ALS is analyzed and some 
design issues are discussed. The concept of access set is used for the analysis of near- 
coincident channel faults leading to system crash. When fault arrivals are independent and 
the system is free of error propagation and latent faults, the probability of system crash due 
to multiple channel faults is dictated primarily by component failure rates. It is shown that 
with the state-of-the-art technology, the probability of system crash due to near-coincident 
channel faults is insignificant even when the system size is fairly large. 

The case study of ALS has shown that the chief cause of unreliability in large redundant 
systems is the depletion of hardware resources (as a result of component failures), especially 
when the system has a long mission time. It is worth mentioning that our evaluation of 
the effectiveness of CEMs in the ALS is very conservative, because all transient faults are 
assumed to be recoverable by NMR_CEM and conventional NMR systems. Since transient 
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faults are typically 10 times more frequent than permanent faults [8, 9], the reliability 
improvement by using CEMs would be even greater when conventional systems do not have 
perfect recovery capability for transient faults. 

Although emerging new technologies continue to improve hardware reliability and perfor- 
mance, they also stimulate new applications which require higher reliability and computing 
power. Thus, as main memory is the most vulnerable system component for the current 
technology, it is expected to be the reliability bottleneck in future computing systems. For- 
tunately, the design of CEMs for main memory is very simple, and very high fault coverage 
can be achieved with low overhead. For the example discussed in this report, about 22% 
of the memory overhead was induced for each channel to embed SEC/DED codes into its 
main memory. By contrast, adding channels will increase overheads substantially more 
in the power, physical size and channel synchronization of the system. Thus, embedding 
SEC/DED codes into main memory is a much more cost-effective method to prolong the 
resource depletion time than adding more channels to the system. 

Large main memory coupled with slack voters makes memory re-alignment very time- 
consuming. Thus, memory re-alignment in a large system should be avoided whenever 
possible. It is shown in this report that CEMs can dramatically reduce the need of memory 
re-alignment, and can speed up the re-alignment process substantially. 

Another serious threat to memory re-alignment is the propagation of errors. If error 
propagation is not effectively prevented, the number of contaminated pages will increase 
quickly, and thus, the number of pages needing to be re- aligned will increase quickly. Error 
propagation can be prevented only when the system has very good error detection capability. 
This is a matter of our future research. 
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Appendix A: List of Symbols 


AS\ u The z-th access set in the system. AS 1 is essentially a set of memory words 

that will be accessed continuously by the CPU (active agent) for a period of 
time, u is the size of access set. 

Jk i e K is the number of recovery pages in the system. is an upper bound for 

/, the number of faulty recovery pages, such that P(t ra > K + /^jr) < c. 

M\ Mf. is the length of time that the active agent stays in ASk during its i th 

visit to ASk- 

to to is the number of access sets in the system. 

to ./ 

N(t), N*(t) N(t ) = ^2 N\t ), where is the number of the agent’s visits to AS' by 

i—X 

time t, and N(t) is the total number of visits to access sets by the active 
agent during [0, f). 

NMR.CEM NMR_CEM is an N modular redundant system with CEMs embedded into 
QMR each channel. QMR is a quadruplex modular redundant system. 

P c (Yi), P N {t) P c (Yi) is the probability of a channel becoming faulty during time interval 
Y{. P/v(t) is the probability of system crash caused by multiple channel faults 
during time interval [0,t). 

Pk(P — f ) Pk(F = /) "is the probability of / recovery pages becoming faulty when the 

number of recovery pages is K, and Pk^t* = K + /t£) is the probability of 

rxr — 

the re-alignment time = K + f jr, 

Ti, Yi T{ is the time the z-th vote is held, Y{ is the interval between Tj_i and T{. 

VJ, S l k V£ is the event that the active agent makes the fc-th visit to AS 1 . S l k is the 

moment the event V k begins. 

Xj Xj is a random variable denoting the number of fault occurrences to AS 1 

during the agent’s j-th visit to access sets. 
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A a , A p , A a , A p , A;, A m , A 0 are the failure rates of an access set, a processor (including 

A m , A 0 control logics), interstage, RAM memory, and ROM memory of each channel 

in the system, respectively. 

/i/, $ /i/ is the ratio test of , fij = — .1), where $ is the product 

of memory size (words), failure rate of a memory word, number of redundant 
channels, and the time t . 

n, H E II# and IIq are the environmental and quality factors of a component, re- 

IIq spectively. Component failure rate is adjusted by A' = II# x IIq X A. 

Hi Hi is the z-th recovery page. 
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Figure 1: The voting and communication network of computing channels in FTP. 
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Figure 2: The total probability of nearly-coincident faults with different access set sizes and 
mission times. 



Figure 3: The Markov model of system reliability. 
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Figure 4: The unreliability of different systems when SEC/DED codes are embedded into 
RAM when (a) II = 0.1, and (b) 1 1 = 1. 
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Figure 5: The unreliability of different systems when SEC/DED codes are embedded into 
ROM and RAM where (a) II = 0.1, and (b) H = 1. 
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Figure 6: The probability of FTP having fault masking capability before launching when 
SEC/DED codes are embedded into RAM. 
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Figure 7: Probability distribution functions of memory re-alignment times when t — 200 
hours, the system has 4 channels, each with 4 M words memory. 
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Figure 8: The cost function of a system with perfect detection capability, 4 channels, 4M 
words, t = 150 hours, A = 0.75 x 10~ 9 /hour-word, and e = 10 -5 . (a) The global plot, and 
(b) a blow up of the cost function around the optimal point. 
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